Data-driven blood glucose level prediction in type 1 diabetes: a comprehensive comparative analysis

Accurate prediction of blood glucose level (BGL) has proven to be an effective way to help in type 1 diabetes management. The choice of input, along with the fundamental choice of model structure, is an existing challenge in BGL prediction. Investigating the performance of different data-driven time series forecasting approaches with different inputs for BGL prediction is beneficial in advancing BGL prediction performance. Limited work has been made in this regard, which has resulted in different conclusions. This paper performs a comprehensive investigation of different data-driven time series forecasting approaches using different inputs. To do so, BGL prediction is comparatively investigated from two perspectives; the model’s approach and the model’s input. First, we compare the performance of BGL prediction using different data-driven time series forecasting approaches, including classical time series forecasting, traditional machine learning, and deep neural networks. Secondly, for each prediction approach, univariate input, using BGL data only, is compared to a multivariate input, using data on carbohydrate intake, injected bolus insulin, and physical activity in addition to BGL data. The investigation is performed on two publicly available Ohio datasets. Regression-based and clinical-based metrics along with statistical analyses are performed for evaluation and comparison purposes. The outcomes show that the traditional machine learning model is the fastest model to train and has the best BGL prediction performance especially when using multivariate input. Also, results show that simply adding extra variables does not necessarily improve BGL prediction performance significantly, and data fusion approaches may be required to effectively leverage other variables’ information.

It is essential to maintain a normal blood glucose level (BGL) when managing type 1 diabetes mellitus (T1DM) 1 .
To aid this, one application of artificial intelligence is to predict the BGL of individuals with T1DM utilising the current and past information 2 .An early warning system for insufficient glycaemic control can be provided by BGL prediction 3 .However, this prediction is challenging because of some of the physiological factors such as the delay in food and insulin absorption, considerable variation between and within patients, and the complexity of interference factors such as physical activity 4,5 .Hence, despite all the research performed in the field of BGL prediction, accurate predictions remain a challenge 6 .
Based on the model structure and knowledge requirements, there are three main types of BGL prediction algorithms: physiological models (extensive knowledge), hybrid models (intermediate knowledge), and datadriven models (black-box approaches) 2,7,8 .Data-driven models have attracted considerable attention and are being increasingly explored.These models can be classified into classical time series forecasting (CTF), traditional machine learning (TML), and deep neural network (DNN) approaches.Comparing the efficacy of various datadriven prediction models using different approaches would be beneficial in the advancement of BGL prediction performance.However, using different datasets, different inputs, and different model settings has made this comparison difficult and limited studies have been published in this regard.Xie and Wang 6 benchmarked a classical autoregression with an exogenous input model against ten different machine learning models for BGL prediction in T1DM patients.Zhang et al. 9 , also, compared four different data-driven models to forecast BGL

Material and methods
This section gives a brief description of the datasets used, data preprocessing steps, and the developed prediction models from different time series forecasting approaches.Subsequently, applied evaluation and statistical analyses are presented.

Dataset
According to the review performed by Felizardo et al. 32 , the Ohio T1DM dataset 30,31 with replication capability is the most frequently used clinical dataset in the literature that is publicly accessible.Hence, to do a reliable comparison, in this study, we used the Ohio T1DM dataset.The Ohio T1DM dataset comprises two sets of data from 12 people with T1DM.The first dataset related to six T1DM patients was released in 2018 for the first BGL prediction challenge 33 (called Ohio_2018).The second dataset related to an additional six patients was released in 2020 for the second BGL prediction challenge 34 (called Ohio_2020).Data contributors comprised five females and seven males and were aged 20 to 80 years at data collection time.Table 1 provides the details related to the gender and age range of participants in both cohorts.
An insulin pump, a CGM sensor, and a fitness band were used by the patients.Along with physiological sensors, each individual reported Carb estimations, Bolus, and life events.Participants in both cohorts used a www.nature.com/scientificreports/Medtronic Enlite CGM sensor for measuring their BGL.In the Ohio_2018 dataset, patients wore Basis Peak fitness bands that collected heart rate (HR) data, and patients in the Ohio_2020 cohort wore Empatica Embrace fitness bands collecting magnitude of acceleration (MA) data.Data were collected over an eight-week period by allocating the last 10 days for testing sets and the rest for training sets.BGL data from CGM sensors and HR data from the Basis Peak band were collected with a 5-minute aggregation.Data of MA from the Empatica Embrace band was collected every minute.Further information about the data collection can be found in 30,31 .In this study, automatically collected BGL and activity data and self-reported Carb and Bolus data are used.

Preprocessing
There were some mandatory preprocessing steps to overcome many imperfections and missing data when analysing real-world data.Additionally, some data preprocessing was required depending on the forecasting approach used (Fig. 1).

Imputation and alignment
The initial preprocessing step was to address the issue of missing BGL and physical activity data.These missing values were interpolated in training and extrapolated in testing sets linearly.No reported timestamps for Carb and Bolus data were assigned to zero.The following preprocessing step was to align the BGL data with other data.Data of MA, with a resolution of one minute, was downsampled to a resolution of five minutes by taking the nearest MA data point with a BGL data point and removing the remainder.The HR data, which had the same resolution as BGL data, only required to be aligned.Additionally, the unavailable data timestamps at the beginning/ending of each set, which occurred due to different times in the wearing sensors, were discarded.

Stationarity
When applying the CTF approach, two common statistical tests were applied to check the primary assumption of stationarity 35 ; the Augmented Dickey-Fuller (ADF) test 36 and the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test 37 .Time series in which both tests confirmed the stationarity were defined as stationary.Since the ADF test indicated stationarity for all variables and all patients, integrated differencing was applied to the time series in which the KPSS indicated non-stationarity.

Reframing
When applying TML or DNN approaches, the multi-ahead time series forecasting problem should be reframed as a supervised learning task.To accomplish this, historical observations were used as inputs, and future observations were used as outputs.

Time series forecasting approaches
To comprehensively investigate and compare the performance of BGL prediction, different time series forecasting categories, including CTF, TML, and DNN, were examined.Also, following the BGL prediction challenges in which the participants were asked to predict BGL 30 and 60 min in advance and in line with many papers in the literature 6,18,25 , 30 and 60 min prediction horizons were considered.There is a pool of models for BGL prediction in each category.For the sake of feasibility and in order to minimise the complexity of comparison, for each category, a common successful model found in the literature was developed and fine-tuned as a representative.For input comparison purposes, each model was first trained as a univariate prediction model; then, its counterpart was developed as a multivariate prediction model.The prediction models are briefly described in the following.

Classical time series forecasting (CTF)
CTF is a common approach for the BGL prediction task 6,38 .One of the most commonly used models in this category is the autoregressive integrated moving average (ARIMA) 39 .ARIMA is a combination of linear processes of autoregression (AR) and moving average (MA) models, as well as integrated differencing.It models the future as a linear combination of lags and lagged residual errors in a differenced time series in the case of non-stationarity.
To develop an ARIMA model, the parameters of the model, including p (AR order), d (differencing order), and q (MA order), should be determined.The p and q parameters were optimised for each patient automatically by examining each parameter from zero to 36.The d parameter was also determined by considering the stationarity

Traditional machine learning (TML)
A TML approach has also received significant attention for predicting BGL.Support vector machines (SVMs) have been shown to be the most accurate in the BGL prediction task among different classes of machine learning algorithms 5,40 .Also, among different types of SVMs, support vector regression (SVR) is the most commonly employed technique for predicting BGL 5 .In this study, in line with the successfully developed SVM model for BGL prediction in the literature 41 , an SVR model with a radial basis kernel was developed.Moreover, vectorised multivariate data were utilised as the input for developing multivariate counterparts to have a multivariate prediction using SVM.The hyperparameters of the SVR model, including gamma, C, and epsilon, were chosen using a grid search during a tuning process for each patient and each input.Search spaces of {0.1,1, 10, 100}, {0.001,0.01,0.1, 1}, and {0.01, 0.1, 1, 10} were explored to optimise gamma, C, and epsilon parameters, respectively.The chosen parameters are summarised in Table 3.

Deep neural network (DNN)
As a class of recurrent neural networks, LSTM networks are effective at predicting BGL based on sequential data [42][43][44][45] .In this study, the sequence-to-sequence forecasting task was carried out using an LSTM model recently developed by our team, which has been optimised in the Ohio datasets 13,21 .The vanilla LSTM network consisted of an LSTM layer, a dense layer, and an output layer.The initialiser of He uniform, the activation function of ReLU, the optimiser of Adam, and the loss function of mean square error were chosen.Also, an epoch size of 200 and a batch size of 32 were selected.An initial learning rate of 0.01 was reduced by 0.1 following the usage of a ReduceLROnPlateau callback with patience of 20 after stopping validation loss improvement.

Evaluation criteria
In this work, two regression-based and clinically-based evaluation criteria were examined to comprehensively investigate BGL prediction performance based on different prediction approaches and inputs.The following subsections provide a brief description of these criteria.2), the overall performance of BGL prediction models was evaluated based on root mean square error (RMSE) and mean absolute error (MAE), as two commonly used regression accuracy metrics in BG-related works [46][47][48][49] .
In both equations, N represents the evaluation set size, y i represents the reference, and ŷi represents the prediction.

Clinical-based criteria
The clinical performance of each model was evaluated using the Matthews correlation coefficient (MCC) and surveillance error (SE), which have recently been used for clinical evaluation of BGL prediction 18,43,44 .The MCC criterion was used to measure whether the models could accurately distinguish adverse glycaemic events from normoglycaemic events.Using SE metric, an average of the surveillance error grid 50 interpolated bilinearly, each patient was assigned a unique score.

Statistical analyses
The BGL prediction performance measured by evaluation metrics with various prediction approaches or inputs was also statistically analysed over data contributors for each dataset.In accordance with the conditions of each comparison, appropriate statistical analyses were conducted.
To compare different prediction models, firstly, the Friedman test 51 was conducted in order to find out whether at least two approaches differ significantly (with a significance level of five percent).If this was the case, the post-hoc Nemenyi test 52 was then performed comparing different approaches' performance in a pair-wise fashion.Also, since multiple comparisons were made, the Holm procedure 53 was applied to correct the significance level.A critical difference (CD) diagram 29 was drawn to illustrate the results of each post-hoc test.These analyses were performed for each univariate and multivariate input separately.
To compare univariate and multivariate inputs for each prediction approach, the non-parametric Wilcoxon signed-ranks test 54 , which is an appropriate test for comparing two approaches without the assumption of normality, was applied 29 .This test, with a significance level of five percent was conducted to check the consistency of each evaluation metric calculated for univariate and multivariate inputs over the data contributors of each dataset.The comparison of input was performed for each prediction approach separately.

Results and discussion
In this section, firstly the evaluation results for both Ohio_2018 and Ohio_2020 datasets and 30-minute and 60-minute prediction horizons are presented.Then, depending on which factor is being compared, the results of relative statistical analyses are presented and discussed in two parts; comparing models' approaches and models' inputs.

Evaluation results
Tables 4 and 5 provide the evaluation results for BGL prediction models related to different approaches for both univariate and multivariate inputs, 30 and 60 min in advance in Ohio_2018 dataset, respectively.Also, Tables 6  and 7 provide the evaluation results in the Ohio_2020 dataset, for prediction horizons of 30 and 60 min, respectively.It is worth noting that for the DNN approach, due to the random initialization, the average and standard deviation of evaluation results over 10 runs are reported.Using evaluation results, to compare different models and inputs, statistical analyses were performed.The results are discussed in the following sections.
Moreover, to provide visual clinical insight, colour-coded surveillance error grids are illustrated in Figs. 2, 3, 4, 5, 6 and 7, which are related to different models and inputs for BGL prediction 30 in advance for patient 570.

Comparing models' approaches
Different data-driven time series forecasting approaches are compared using univariate and multivariate inputs, separately.Firstly, the results of statistical analyses are presented and discussed.Secondly, computational costs for different models are compared.Then, according to all presented results, a conclusion is presented.

Univariate input
Table 8 presents p-values of the Friedman test calculated based on evaluation criteria using different BGL prediction approaches with a univariate input.The analysis was performed for both prediction horizons of 30 and 60 minutes, and for both Ohio_2018 and Ohio_2020 datasets, separately.With a significance level of five percent, p-values in bold font are related to the cases with probably at least one significant difference between the performance of models.
Reviewing Tables 4, 5, 6, 7, and 8, it can be concluded that although there are differences between average evaluation metrics related to the performance of different prediction models over data providers of each cohort, www.nature.com/scientificreports/these differences are mainly statistically insignificant.Table 8 shows that just three metrics of RMSE, MAE, and SE calculated for the 60-minute prediction horizon in the Ohio_2018 cohort may be significantly different between at least two prediction models.In those cases, the post-hoc Nemenyi test was performed for pair-wise comparisons between prediction models.Results of the Nemenyi tests are then visualised using CD diagrams, as shown in Figs. 8, 9, and 10, according to metrics RMSE, MAE, and SE, respectively.In each CD diagram, at a significance level of five percent, prediction models that differ insignificantly are linked by a horizontal line.It can be seen that while the TML model outperformed the CTF model significantly based on their average ranks for the examined metrics, the other pair-wise comparisons were not statistically meaningful.Considering the presented results in Table 9 and a significance level of five percent, it can be inferred that among different examined cases for comparing prediction approaches regarding evaluation metrics, prediction horizons, and datasets, at least two prediction approaches may perform differently for BGL prediction 60 minutes in advance in both Ohio_2018 and Ohio_2020 datasets based on all the evaluation metrics.Also, there  9, the result of the Friedman test calculated based on the MCC metric in the Ohio_2020 dataset for the 60-minute prediction horizon was significant, Fig. 20 shows that for the mentioned case, there was not a significant difference between BGL prediction performance using different prediction models.Also,           www.nature.com/scientificreports/

Computational cost
When comparing different prediction models the computational cost of retraining them needs to be considered.The developed models do not have indefinite validity, and readjustments are required following changes in the BGL patterns.The computational costs of different prediction models on a standard laptop computer with a core i7 2.8 GHz processor, an NVIDIA GeForce GTX 1050 Ti GPU, and a 16 GB RAM were measured.Table 10 shows the average training time for different models of all data contributors in each cohort for each input and

Summary
Review of the results presented in "Evaluation results", "Statistical result", and "Computational cost" shows that in more than half of the examined cases regarding evaluation metrics, prediction horizons, and datasets, especially     using a univariate input, the three models performed comparably in BGL prediction.Among the rest of the cases, the TML model achieved the first rank with a significant superiority over at least one other model.In addition, the TML model was also the fastest model to be trained.The CTF and DNN models performed similarly for BGL prediction in all cases.Overall, the results suggest that the TML model is the superior approach for BGL prediction among the different examined data-driven models.

Comparing models' inputs
In this section, the effectiveness of univariate and multivariate inputs are compared using different CTF, TML, and DNN approaches, separately.The outcomes of statistical analyses are given and discussed in the following first section.Furthermore, a discussion about the ease and complexity of different inputs for collection and processing is presented.The results are then summarised to draw conclusions.

Statistical result
CTF approach Table 11 presents the Wilcoxon test p-values, based on each evaluation metric, prediction horizon, and cohort for examining whether the BGL prediction performance of the CTF model differs statistically significantly using different inputs.With a significance level of 5 %, the test outcomes show that exogenous variables did not affect the BGL prediction performance using the CTF model 60 min in advance in the Ohio_2018 dataset and both at 30 and 60 min in advance in the Ohio_2020 dataset based on all evaluation metrics.There is only one statistically significant difference (marked with bold font) between univariate and multivariate inputs using the CTF model, which is related to the RMSE metric for predicting the BGL 30 min in advance in the Ohio_2018 dataset.Considering Tables 4, 5, and 11, it can be concluded that, based on the RMSE metric, the CTF model performed worse with exogenous variables compared to univariate BGL prediction 30 min in advance over patients in Ohio_2018 dataset.
TML approach Table 12 displays p-values of the Wilcoxon test for examining if univariate or multivariate inputs can make a statistically significant difference in BGL prediction performance by applying the TML model.The test was performed over the data contributors of each cohort and was based on each evaluation metric and for each prediction horizon separately.With a significance level of five percent, the test outcome showed that the TML model predicted BGL significantly differently using different inputs for patients in Ohio_2018 dataset based on the SE metric for both prediction horizons.While the TML model performed similarly using different inputs in Ohio_2020 dataset for both prediction horizons.Considering Tables 4, 5, and 12, it can be concluded that the TML model predicted BGL better according to SE metric using multivariate input compared to univariate input in Ohio_2018 dataset for both 30-minute and 60-minute prediction horizons.
DNN approach Table 13 displays the p-values obtained from the Wilcoxon test, which was performed based on each evaluation metric and for each prediction horizon, over the data contributors of each cohort.The test was conducted to determine whether univariate or multivariate input could make a significant difference in BGL prediction performance by applying the DNN model.The results showed that with a significance level of five percent, there was no statistically meaningful difference in the DNN model performance in predicting BGL using univariate or multivariate input in both datasets and for both prediction horizons, according to all examined evaluation metrics.
Ease of data Another important factor to be considered for comparing input for the BGL prediction task would be ease of data access.It is essential to consider how convenient data collection and preprocessing would be for each input.Developing a BGL prediction model using only data from a CGM sensor, which is a readily accessible tool for T1DM patients, requires automatic data collection with minimum human intervention and facilitates practicality of implementation regarding computational complications.In BGL prediction using a univariate input, there would be no need for extra effort and cost to acquire data from several sensors and modalities 15,16,[18][19][20] .Also, multivariate input needs further data preprocessing steps, including data scaling up/down and data alignment.Moreover, according to Table 10, BGL prediction using multivariate input, needs more computational cost.Overall, univariate input is superior to multivariate input in terms of ease of data collection and processing.

Summary
According to the results in "Evaluation results", "Statistical result", and "Ease of data" the followings can be concluded.There was no conclusive evidence as to whether the use of univariate or multivariate input achieves better BGL prediction performance.With the CTF model, adding exogenous variables could make BGL predictions worse.In contrast, with the TML model, multivariate input may improve BGL prediction, or it may not significantly affect the performance of the DNN model.Also, BGL prediction performance was not significantly impacted by univariate or multivariate input in the Ohio_2020 cohort for the three forecasting models and both prediction horizons.Overall, the results reveal that considering exogenous variables, including Carb, Bolus, and activity, despite forcing more effort and cost, does not conclusively make a significant improvement in the performance of BGL prediction.It is important to note that this conclusion is based on the examined naive approaches of including exogenous variables.However, applying advanced data fusion approaches may alter the performance of the models and this conclusion.

Conclusion
This work has comprehensively investigated the performance of different data-driven time series forecasting approaches including CTF, TML, and DNN, as well as the performance of different inputs, including univariate (BGL data only) and multivariate (BGL data along with Carb, Bolus, and activity) to provide insightful findings in the context of BGL prediction.The performance of different prediction approaches and inputs were compared for BGL prediction 30 and 60 min in advance.These investigations were performed using two Ohio_2018 and Ohio_2020 cohorts separately.Three prediction models related to the three different time series forecasting approaches were developed.The models were trained with a univariate input, and their counterparts were developed to cope with multivariate input.The different cases were evaluated using regression-based and clinical-based metrics followed by rigorous statistical analyses.
The obtained results showed that all three prediction models performed comparably in most cases.In the remaining cases, the TML model, which was also the fastest model to train, performed significantly better than the CTF, the DNN or both especially when using multivariate input.Moreover, comparing different inputs for each prediction model showed that adding extra variables, including Carb, Bolus, and activity and converting the univariate forecasting task to multivariate does not necessarily improve the BGL prediction significantly.In fact, different time series forecasting approaches perform differently for predicting BGL when dealing with multivariate data.The CTF model may perform worse by adding exogenous variables, the TML model may perform better using multivariate input, and the DNN model performs similarly using univariate or multivariate input.From the obtained results it is also can be inferred that to deploy the data of exogenous variables more effectively, information extraction and data fusion approaches may be required.Hence, investigating optimal approaches for fusion of extra variables with BGL is suggested as future work.
It is worth mentioning that in the current work, we investigated naive multivariate input for incorporating exogenous variables.Therefore, investigating effective approaches for leveraging affecting variables could be important to make a conclusive decision regarding the input of BGL prediction models.Hence, developing some approaches for effectively incorporating exogenous variables would be a future direction.Also, this work focused on data-driven approaches and using Physiological models for Carb and Bolus and developing hybrid prediction models are suggested.Moreover, it is worth noting that other potentially superior models for BGL prediction can be used in each forecasting group.Specifically, in the DNN approach, instead of LSTM, examining more advanced models with superior performance in handling complex temporal patterns (e.g.PatchMixer and SegRNN) could be suggested.

Figure 1 .
Figure 1.A schematic diagram demonstrating the preprocessing steps.

Figure 2 .
Figure 2. The colour-coded surveillance error grid related to the predictions of CTF approach with univariate input 30 min in advance for patient 570.

Figure 3 .
Figure 3.The colour-coded surveillance error grid related to the predictions of TML approach with univariate input 30 min in advance for patient 570.

Figure 4 .Figure 5 .
Figure 4.The colour-coded surveillance error grid related to the predictions of DNN approach with univariate input 30 min in advance for patient 570.

Figure 6 .
Figure 6.The colour-coded surveillance error grid related to the predictions of TML approach with multivariate input 30 min in advance for patient 570.

Figure 7 .
Figure 7.The colour-coded surveillance error grid related to the predictions of DNN approach with multivariate input 30 min in advance for patient 570.

Figure 8 .
Figure 8. CD diagram of comparing different prediction models with univariate input pairwisely over the data contributors of Ohio_2018 dataset for the 60-minute prediction horizon based on RMSE metric.

Figure 9 .
Figure 9. CD diagram of comparing different prediction models with univariate input pairwisely over the data contributors of Ohio_2018 dataset for the 60-min prediction horizon based on MAE metric.

Figure 10 .
Figure 10.CD diagram of comparing different prediction models with univariate input pairwisely over the data contributors of Ohio_2018 dataset for the 60-minute prediction horizon based on SE metric.

Figure 11 .
Figure 11.CD diagram of comparing different prediction models with multivariate input pairwisely over the data contributors of Ohio_2018 dataset for the 30-min prediction horizon based on RMSE metric.

Figure 12 .
Figure 12.CD diagram of comparing different prediction models with multivariate input pairwisely over the data contributors of Ohio_2018 dataset for the 60-min prediction horizon based on MAE metric.

Figure 13 .
Figure 13.CD diagram of comparing different prediction models with multivariate input pairwisely over the data contributors of Ohio_2018 dataset for the 30-min prediction horizon based on SE metric.

Figure 14 .
Figure 14.CD diagram of comparing different prediction models with multivariate input pairwisely over the data contributors of Ohio_2018 dataset for the 60-min prediction horizon based on RMSE metric.

Figure 15 .
Figure 15.CD diagram of comparing different prediction models with multivariate input pairwisely over the data contributors of Ohio_2018 dataset for the 60-min prediction horizon based on MAE metric.

Figure 16 .
Figure 16.CD diagram of comparing different prediction models with multivariate input pairwisely over the data contributors of Ohio_2018 dataset for the 60-min prediction horizon based on MCC metric.

Figure 17 .
Figure 17.CD diagram of comparing different prediction models with multivariate input pairwisely over the data contributors of Ohio_2018 dataset for the 60-min prediction horizon based on SE metric.

Figure 18 .
Figure 18.CD diagram of comparing different prediction models with multivariate input pairwisely over the data contributors of Ohio_2020 dataset for the 60-min prediction horizon based on RMSE metric.

Figure 19 .
Figure 19.CD diagram of comparing different prediction models with multivariate input pairwisely over the data contributors of Ohio_2020 dataset for the 60-min prediction horizon based on MAE metric.

Figure 20 .
Figure 20.CD diagram of comparing different prediction models with multivariate input pairwisely over the data contributors of Ohio_2020 dataset for the 60-min prediction horizon based on MCC metric.

Figure 21 .
Figure 21.CD diagram of comparing different prediction models with multivariate input pairwisely over the data contributors of Ohio_2020 dataset for the 60-min prediction horizon based on SE metric.

Table 1 .
Information about the gender and age of contributors to the Ohio_2018 and Ohio_2020 datasets.PID patient identity.
www.nature.com/scientificreports/tests.An autoregressive integrated moving average with exogenous variables (ARIMAX) was used for the multivariate prediction, incorporating exogenous variables into the univariate ARIMA model.Table2shows the optimised parameters for each patient training the ARIMA and ARIMAX models.

Table 2 .
The optimised parameters for the ARIMA and ARIMAX models.PID patient identity.

Table 3 .
The optimised parameters for the SVR model.PID patient identity, PH prediction horizon.

Table 4 .
Evaluation results of different prediction approaches and inputs in Ohio_2018 dataset for prediction horizons of 30 min.PID patient identity, PH prediction horizon, RMSE root mean square error, MAE mean absolute error, MCC Matthews correlation coefficient, SE surveillance error, CTF classical time series forecasting, TML traditional machine learning, DNN deep neural network.
Vol.:(0123456789) Scientific Reports | (2024) 14:21863 | https://doi.org/10.1038/s41598-024-70277-xwww.nature.com/scientificreports/Multivariate input Using different forecasting approaches with multivariate input, Table 9 shows Friedman test p-values for each evaluation metric.The test was performed separately for each prediction horizon of 30 and 60 min and in each cohort.The p-values marked in bold font are considered significant at a significance level of five percent, showing that at least two prediction models may differ in the BGL prediction performance.

Table 5 .
Evaluation results of different prediction approaches and inputs in Ohio_2018 dataset for prediction horizons of 60 min.PID patient identity, PH prediction horizon, RMSE root mean square error, MAE mean absolute error, MCC Matthews correlation coefficient, SE surveillance error, CTF classical time series forecasting, TML traditional machine learning, DNN deep neural network.-valuesforcomparingdifferentpredictionmodelsfor the 30-minute prediction horizon in the Ohio_2018 dataset based on RMSE, MAE, and SE metrics.The post-hoc Nemenyi test was conducted for each mentioned case to compare the prediction models in a pair-wise manner.The results of post-hoc tests are graphically presented in CD diagrams, as demonstrated in Figs.11, 12, 13, 14, 15, 16, 17, 18, 19, 20 and 21.A horizontal line connects prediction models that differ insignificantly (with a significance level of five percent).Figures11 and 12show that the TML model, while performing similarly to the CTF model, outperformed the DNN model significantly for predicting BGL in the Ohio_2018 dataset 30 min in advance based on RMSE and MAE metrics, respectively.From Fig.13, 14, and 21 it can be seen that the TML model statistically significantly outperformed both CTF and DNN models in the Ohio_2018 dataset based on SE metric for the 30-min

Table 6 .
Evaluation results of different prediction approaches and inputs in Ohio_2020 dataset for prediction horizons of 30 min.PID patient identity, PH prediction horizon, RMSE root mean square error, MAE mean absolute error, MCC Matthews correlation coefficient, SE surveillance error, CTF classical time series forecasting, TML traditional machine learning, DNN deep neural network.

Table 7 .
Evaluation results of different prediction approaches and inputs in Ohio_2020 dataset for prediction horizons of 60 min.PID patient identity, PH prediction horizon, RMSE root mean square error, MAE mean absolute error, MCC Matthews correlation coefficient, SE surveillance error, CTF classical time series forecasting, TML traditional machine learning, DNN deep neural network.

Table 8 .
p-values of the Friedman test for comparing all prediction models for univariate BGL prediction 30 and 60 minutes in advance in Ohio_2018 and Ohio_2020 datasets.PH prediction horizon, RMSE root mean square error, MAE mean absolute error, MCC Matthews correlation coefficient, SE surveillance error.

Table 10 .
The average training time (seconds) for models using different approaches across all patients in each cohort for each input and prediction horizon.PH prediction horizon, CTF classical time series forecasting, TML traditional machine learning, DNN deep neural network.

Table 11 .
P-values of the Wilcoxon test for comparing univariate and multivariate input for the CTF model for BGL prediction 30 and 60 min in advance in Ohio_2018 and Ohio_2020 datasets.PH prediction horizon, RMSE root mean square error, MAE mean absolute error, MCC Matthews correlation coefficient, SE surveillance error.

Table 12 .
P-values of the Wilcoxon test for comparing univariate and multivariate input for the TML model for BGL prediction 30 and 60 min in advance in Ohio_2018 and Ohio_2020 datasets.PH prediction horizon, RMSE root mean square error, MAE mean absolute error, MCC Matthews correlation coefficient, SE surveillance error.

Table 13 .
P-values of the Wilcoxon test for comparing univariate and multivariate input of the DNN model for BGL prediction 30 and 60 min in advance in Ohio_2018 and Ohio_2020 datasets.PH prediction horizon, RMSE root mean square error, MAE mean absolute error, MCC Matthews correlation coefficient, SE surveillance error.